CONFIDENCE INTERVALS FOR LOW-DIMENSIONAL PARAMETERS 
WITH HIGH-DIMENSIONAL DATA 
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Abstract. The purpose of this paper is to propose methodologies for statistical inference 
of low-dimensional parameters with high-dimensional data. We focus on constructing con- 
fidence intervals for individual coefficients and linear combinations of several of them in a 
linear regression model, although our ideas are applicable in a much broad context. The 
theoretical results presented here provide sufficient conditions for the asymptotic normality 
of the proposed estimators along with a consistent estimator for their finite-dimensional 
covariance matrices. These sufficient conditions allow the number of variables to far ex- 
ceed the sample size. The simulation results presented here demonstrate the accuracy 
of the coverage probability of the proposed confidence intervals, strongly supporting the 
theoretical results. 
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1. Introduction 

High-dimensional data is an intense area of research in statistics and machine learning, 
due to the rapid development of information technologies and their applications in scientific 
experiments and everyday life. Enormous amounts of large complex datasets have been col- 
lected and are waiting to be analyzed; meanwhile, an enormous effort has been mounted in 
order to meet this challenge by researchers and practitioners in statistics, computer science, 
and other disciplines. A great number of statistical methods, algorithms, and theories have 
been developed for the prediction and classification of future outcomes, the estimation of 
high- dimensional objects for different purposes, and the selection of important variables or 
features for further scientific experiments and engineering applications. However, statistical 
inference with high-dimensional data is still a largely untouched territory due to the com- 
plexity of the sampling distributions of existing estimators. This is particularly the case in 
the context of the so called large-p-smaller-n problem, where the dimension of the data is 
greater than the sample size. 

Linear regression is one of the best understood statistical models in high-dimensional 
data. Important work has been done in problem formulation, development, and analysis 
of methodologies and algorithms, and theoretical understanding of their performance un- 
der sparsity assumptions. This includes ii regularized methods and their analysis |Tib96| 
ICDSOIL ICToy , ■GR04; GreOe; 'MBOe; TroOei [ZY061 IWaiOQl ICTOTl IZHOSl IMY091 iBRTOOl 
IKolOQl IZhiM .vdGBOg. YZIO. .KLTIL SZU\ . noncon vex pena lized methods [FF931 IFLOTI 
iFPCMl [ZLOSl IK(X)08[ IZhalnl IZZTI] . greedy metho ds [Zhallaj . adaptive methods [ZouOGL 
IHMZOSt IZhallbl IZZll] , screening methods [FLOSj . and more. For further discussion, we 
refer to related sections in the recent book [BvdGll] and recent reviews in |FL10| IZZllj . 



2 



Confidence Intervals 



Among existing results, variable selection consistency is most relevant to statistical infer- 
ence. An estimator is variable selection consistent if it selects the oracle model composed 
of exactly the set of variables with nonzero regression coefficients. In the large-p-smaller-n 
setting, variable selection consistency has been established under incoherence and other 
^oo-type conditions on the design matrix for the Lasso |MB061 ITro06[ IZY06| IWaiOQj . and 
under sparse eigenvalue or ^2-type conditions for nonconvex methods f FP04l IZhallal IZhalO| 
IZhallbl IZZllj . Another approach in variable selection with high-dimensional data is sub- 
sampling or randomization methods, notably the stability selection method proposed in 
|MB10| . Since the oracle model is typically and nearly necessarily assumed to be of smaller 
order in dimension than the sample size n in selection consistency theory, consistent variable 
selection allows a great reduction of the complexity of the analysis from a large-p-smaller-n 
problem to one involving the oracle set of variables only. Consequently, taking the least 
squares estimator on the selected set of variables if necessary, statistical inference can be 
justified in the smaller oracle model. However, statistical inference based on selection con- 
sistency theory typically requires that all nonzero regression coefficients be greater than a 
noise level infiated to take model uncertainly into account. This assumption of either none 
or uniformly strong signal strength for all individual variables is, unfortunately, seldom 
supported by either the data or the underlying science, especially in biological and medical 
applications. 

2. Methodology 

We develop methodologies and algorithms for the construction of confidence intervals for 
the individual regression coefficients and their linear combinations in the linear model 

(1) y = X/3 + £, £ ~ AA(0, CT^/), 

where y G M" is a response vector, X = (a;i, . . . , Xp) G M"^^ is a design matrix with columns 
Xj, and (3 = (/3i, . . . , Pp)'^ is a vector of unknown regression coefficients. The design matrix 
X is assumed to be deterministic throughout the paper, except in Subsection 3.3. 

The following notation will be used. For vectors v = (vi, . . . ,Vm) of any dimension, 
supp(t7) = {j : Vj 0}, |t;|o = |supp(t>)| = #{j : vj / 0}, and \v\g = lujl'?)^/'?, with the 
usual extension to (7 = oo. For A C {1, . . . ,p}, va = {vj,j € A)^ and Xa = {x^, k G A), 
including A = -j = {I, . . . ,p} \ {j}. 

2.1. Bias corrected linear estimators. In the classical theory of linear models, the least 
squares estimator of an estimable regression coefficient Pj can be written as 

(2) pf-^:=y-'xf/\xf\l 

where xj- is the projection of Xj to the orthogonal complement of the column space of 
X^j = {xk, k ^ j). For estimable f3j and /3k, 

(3) Cov(^('-),^f = a\xffxi/{\xf\2 \xi\2). 

In the high-dimensional case p > n, rank(J^_j) = n for all j when X is in general 
position. Consequently, xj- = and (j2| is undefined. However, it may still be interesting 
to preserve certain properties of the least squares estimator. One advantage of ((2j) is the 
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exphcit formula ([s]) of the covariance structure. This feature holds for all linear estimators 
of /3. For any score vector Zj not orthogonal to Xj, the corresponding linear estimator 
satisfies 



5"(Zm) 



Z-Xj Z-Xj Z-Xj 



with a similar covariance structure to ([s]). A problem with such a linear estimator is its 
bias. For every k j with zjx^ ^ 0, the contribution of /3fc to the bias is linear in In 

the worst case scenario where sgn(/3fe) = sgn{zjxk) for all k, the bias of /^j'*"^ may exceed 
the order of existing ii error bounds for the estimation of the entire vector /3. We note that 
for rank(X_j) = n, it is impossible to have Zj 7^ and zjxf^ = for all k 7^ j, so that 
bias is unavoidable. Still, this simple inspection of the bias of the linear estimator suggests 

a bias correction with an initial estimator p : 

X 'n(init) X T 

with a score vector Zj depending on X only. One may also interpret Q as a one-step self 
bias correction from the initial estimator, 

a ._ ^ , Zjjy - X(5 } 
Pj — Pj + ^T^. 

Z- Xj 

The estimation error of Q can be decomposed as a sum of noise and approximation error: 

(5) - = 4f + 4- E ^J^'^(/3^ - 

Z- Z- Xj 

A full description of Q still require specifications of the score vector Zj and the initial 
estimator (3 . These choices will be discussed in the following two subsections. 



2.2. Low-dimensional projections. A proper choice of Zj should control both the noise 
and approximation error terms in ([5]), given suitable conditions on {X,/3} and an initial 

estimator \ Recall that Zj aims to play the role of xj-, the projection of Xj to the 
orthogonal complement of the column space of X-j = {xj^,k 7^ j). When \x'j\2 is not 
too small, we may simply take Zj = xj-. When \xj-\2 is too small, e.g. xj- = when 
rank(JC_j) = n, we may use a Zj proportional to a relaxed projection of Xj. Since zj is 
a relaxed projection of Xj and the estimator Q is given by a bias-corrected projection of 
y to the direction of Zj, hereafter we will call pi) the low-dimensional projection estimator 
(LDPE) for easy reference. 

The projection xj- is the residual of the least squares fit of xj on X-j. A familiar 
relaxation of the least squares method is to add an li penalty. This leads to the choice of 
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Zj as the residual of the Lasso: 



(6) Zj = Xj - X_j7_j, 7_j- = argmin|^^ — + Aj|6|i|. 



lin <j ^-^^ — 

6 

Explicit choices of \j are described in the next subsection. A rationale for the use of a 
common penalty level \j for all components of 5 in ([6]) is the normalization of all variables 
to jaifcli — ^- alternative in Subsection 2.3 called restricted LDPE, the penalty is set 

to zero for certain components of 6 in ([6]). 

It follows from the Karush-Kuhn- Tucker conditions for (|6|) that \'x^z.jjn\ < Xj, so that 



(7) I V zjxki^k - Pf'^) < max \zjxk\ |/3"'^^" - /3|i < nA, - p\ 



■Ainit) i"^!*™*) 



This of course is a conservative bound. Let rjj = nXj/\zj\2 and tj = \zj\2/\zjxj\. Since 
zje ~ N{0,a\zj\2), it follows from ([5| that 



(8) r?,-|/3' '-/3|i/(T = o(l) ^ rri(/3^.-/3^.) «Ar(o,a 

Sufficient conditions for r]j\P — (3\i/a = o(l) will be given in the next Section. 

2.3. Implementation with the scaled Lasso. We describe specific implementations us- 
ing scaled Lasso |AntlO| ISZlOl ISZllj to provide the initial estimator , the noise level 
a, and the score vectors zj. 

The scaled Lasso is a joint convex minimization method given by 

^(init) fly — Xblo (J 

""■gm""~ 

with a penalty level Aq. This automatically provides an estimate of the noise level in addition 
to the initial estimator of /3. We use Aq = Xuniv = \/ (2/re) logp in our simulation study. 
Existing error bounds for the estimation of both /3 and a require Aq = ^ y^(2 /n) log(p/ e) 
with certain A> 1 and < e < 1 |SZllj . 

The scaled Lasso can be also used to determine Xj for the Zj in ([6]). However, the 
penalty level for the scaled Lasso, set to guarantee performance bounds for the estimation 
of regression coefficients and noise level, may not be the best for controlling the bias and 
the standard error of the LDPE. In (j?]) and (jsj), we use rjj = Xj/\zjxj\ to bound the 

ratio between the bias and standard error of f3j and tj = \zj\2/\zjxj\ to approximate the 
standard error. We choose Xj by tracking r]j and Tj in the Lasso path as follows. 

The basic idea is to allow some over fitting of Xj as long as tj and rjj are reasonably 
small. Let 

(10) = argminjliCj - X-jb\l/{2n) + X\b\i\, 

Zj{X) = Xj - X_j7_^(A), 



2\ 



T]j{X) = ma^\xlzj{X)\/\zj{X)\2 = nX/\zj{X)\2, 

k'FJ 

T,(A) = \zjiX)\2/\xJz,iX)\ 
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be the coefficient estimator 7 
regressing Xj against X 
model at a penalty level 



j. Let a* 



residual Zj, and factors rjj and tj along the Lasso path for 
be scaled Lasso estimate of the noise level in this linear 



(11) 



Co 



arg mm mm 

rr b 



{ 



X 



2na 



If we used scaled Lasso with \zj{a*\*)\ = to calculate 7_j, we would pick Zj(d-*X*) 

as Zj. This would provide factors Tj{a*X*) and Vji^j'^j 



n'/^X* 



-v/2Togp in te) 



To increase the accuracy of the coverage probability of the confidence interval for f3j, 
we reduce the ratio of bias to standard error of our estimator in exchange for a controlled 
increase in its standard error; specifically, we allow the tj to increase by an additional factor 



in order to reduce 7]j. Thus, with (10) and (11), we pick 



(12) 



Zi 



j{Xj), Xj 



arg mm 
A 



{r?,(A):T,(A)<(l + Ko)r,(5*A*)}, 



where kq > is a pre-determined constant. This yields tj = Tj{Xj) and r/j = r]j{Xj) in 
The following proposition, proved in the appendix, summarizes some useful properties of 



the procedure (10), (11), and (12). 



Proposition 1. Both functions \zj{X)\2 andr]j{X) are nondecreasing in X and the function 
Tj(A) is no greater than l/|2;j(A)| in the Lasso path (10). Moreover, it holds in in (12) that 

r,(A,)<(l + Ko)/(ay/2). 



(13) 



r/,(A,)<r?,(a*A*) 



nX], 



Remark 1. Since rjj{X) is a nondecreasing function of X, (12) can be carried out by mini- 
mizing X under the constraint on Tj{X). 

Proposition fTl allows us to set a specific A^ to control rjj and a kq to control the inflation 
of the standard error from the scaled Lasso. Sensible choices include kq = 1/2 and A^ = 

Xuniv = (2/n) logp to guarantee rjj < \/21ogp. Along with ( [lo| , ( [ll| , and (12), this gives 
a complete description of a specific implementation. We would like to emphasize here that 
given Ko, the score vector zj in (12) is completely determined by the design matrix X. 



We have also experimented with an LDPE with a restricted Lasso relaxation for Zj . This 
restricted LDPE can be viewed as a special case of a more general weighted low dimensional 
projection with different levels of relaxation for different variables x/. according to their 
correlation to xj. Although we have used ([?]) to bound the bias, the summands with larger 
absolute correlation \xjxk/n\ are likely to have a greater contribution to the bias due to 

initial estimation error — /3fc|. A remedy for this phenomenon is to force smaller 

\zjxk/n\ for large \xjxf:/n\ with a weighted relaxation. For the Lasso this weighted 
relaxation can be written as 



77-,-, 7- 



arg mm ■ 
b 



with Wk being a decreasing function of the absolute correlation \xjxi./n\. For the restricted 
LDPE, we simply set Wk = for large \xjxk/n\ and Wk = 1 for other k. 
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Here is a scaled Lasso implementation of this restricted LDPE. Let Kj^rn be the index 
set of the m largest \xjxk\ with k j and Pj^m be the orthogonal projection to the linear 



span of {xk,k G Kj^m}- We modify the procedure (10), (11), and (12) by first taking the 



projection of all design vectors to the orthogonal complement of {x^, k E Kj^m}- 



(14) 



Zj — f {P j^^Xj, P -j). 



where f{xj,X-j) denotes the zj in (12), explicitly as a function of Xj and X-j. 



2.4. Confidence intervals. In Section 3, we will provide sufficient conditions on X and f3 
under which the approximation error in ^ is of smaller order than the standard deviation of 
the noise component. We construct approximate confidence intervals for such configurations 
of {X,P} as follows. 

The covariance of the noise component in ^ is proportional to 



(15) 



T 
ZjZk 



I ^ I 
\Zj Xj\ 



T I 
Zy. Xj^ I 



a 



'Gov 



T 

T 
Z- Xj 



T 

T 



Let /3 = (/?!,... , I3.pf be the vector of LDPE /3j in (4). For example, we may choose /3 



in (^9^ and Zj in (12), (14), or (26) in the construction of (3 



For sparse vectors a, e.g. 

2 for a contrast between two regression coefficients, an approximate (1 — a) 100% 



confidence interval is 
(16) 



|a^3 



< a^~^{l - a/2){a'^Va 



,1/2 



where a is the scaled Lasso estimator of the noise level cr in ([£]) and $ is the standard normal 
distribution function. An alternative, larger estimate of a, which produces more conser- 
vative approximate confidence intervals, is the penalized maximum likelihood estimator of 
ISBvdGlO] . 



3. Theoretical Results 

We show that when the £i loss of the initial estimator (3 is of an expected mag- 
nitude and the noise level estimator a is consistent, the LDPE based confidence interval 
has approximately the preassigned coverage probability under a capped ii relaxation of the 
sparsity condition \f3\o < s, provided that slogp <C n^/^. We review proper conditions on 

X and (3 under which such convergence of f3 and a has already been established. We 
prove that for certain random design matrices, the width of the confidence interval is of the 
order n~^/^. 

3.1. Deterministic designs. Here we establish the asymptotic normality of the LDPE (Q 



and the validity of the resulting confidence interval ( 16 ) for deterministic design matrices 



Let Xuniv = Y^(2/n) log p. Suppose f3 is sparse in the sense of 
(17) I]^=i 
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This condition holds if /3 is £o sparse with |/3|o < s oi £q sparse with \P\q/{crXuniv)'^ < s, 
< q < 1. A generic condition we impose on the initial estimator is 



(18) 



P{|3^"'*^ - (3\i > CisaV{2/n)log{p/e)} 



< e 



for a certain fixed constant Ci and all a^/p^ < e < 1, where qq € (0, 1) is a preassigned 
significance level. By requiring a fixed Ci, we implicitly impose regularity conditions on 
the design X and the sparsity index s in (17). Existing oracle inequalities can be used to 
verify (18) for various regularized estimators of /3 under different sets of conditions on X 
and /3 [CTOTl IZHMI IBRTOQl IvdC^BOQI IZhaOQl IZhalOl lYZTnl ISZTTllZZTT] . Although most 
existing results are derived for penalty/threshold levels depending on the noise level a and 
under the £q sparsity condition, their proofs can be combined or extended to obtain (18). 
We also impose a similar generic condition on an estimator a for the noise level: 



(19) 



P[\a/(j-l\ > C72s(2/n)log(p/e)} < e, 



with fixed C2 and all ao/p^ < e < 1- We use the same e in (18) and (19) without much 
loss of generality. For the joint estimation of {f3,a} with scaled Lasso ([9|), a specific set of 
sufficient conditions for (18) and (19), based on ^SZllj . will be stated in Subsection 3.2. In 
fact, the probability of the union of the two events is smaller than e/ max{l, n^/^Ao} in this 
specific case with Aq = A^yJ2Jn)log(J)/~e} for certain A > 1. 



Theorem 1. Suppose uy holds. Let j3j he given by Uv with a Zj depending on X only 



and an initial estimator f3 



(init) 



Let max(e^,e") — )• 0+, tj = \zj\2/\xj Zj\, and rjj 



maxfc^j |a;^ Zj\/\z 
(20) 



j\2- Suppose ( IS) holds and rjjCisa^/ (2/n) log(p/e) < e'„ for all j . Then, 



maxP 

j 



{\riHd,-Pj)-zJe/\z,\2\>e'^] 



< e. 



If in addition (19) holds with C2s{2/n) \og{p/e) < e'^, then for all j < p and t G 

2e < P{T[\^j - Pj) < < <^{t + €+ e'^\t\) + 2e. 



(21) ci,(t_4 

Consequently, for the covariance matrix V in and all fixed m, 

(22) lim inf p[\a^^ - p\ <d^~^{l - a/2){a^Va)^/'^\ = I 

n-s-oo |a|o<m I' ' J 



a. 



Remark 2. In our implementation (12) with X*- = Xuniv, Zj is the residual of the Lasso 
estimator in the regression model for xj against X^j = (x^, k ^ j) with a penalty level Xj to 
guarantee rjj < ^/2^ogp. Thus, the dimension constraint for the asymptotic normality and 
proper coverage probability in Theorem 1 is s{logp)/^/n — >• 0. It is expected from existing 
theoretical results for the Lasso and Proposition^ that tj < \l\zj\2 n^^/^ in (12). 

Since {zje/\zj\2,j < p) has a multivariate normal distribution with identical marginal 
distributions A^(0, o"^), (20) establishes the joint asymptotic normality of the LDPE for 
finitely many f3j under (18). Under the additional condition (19), (21) and (22) justify the 
approximate coverage probability of the resulting confidence intervals. Sufficient conditions 
for (18) and (19) are given in Subsection 3.2, and the convergence rate tj < l/\zj\2 ^ 
is verified in Subsection 3.3. 
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3.2. Oracle inequalities. We focus here on the scaled Lasso ([o]) as a specific choice of the 
initial estimator, since the confidence interval in Theorem [T] is based on the joint estimation 
of regression coefficients and noise level. 

Let ^ > 1 and S = {j : > aXuniv}- Define a sign-restricted cone invertibility factor 

(23) SCW{^,S) = mi{\X''XuUS\/{n\us\i) : Mi < ^\us\i,u,xj Xu < 0, j i S} 

as in [YZlOj . Since |X"^X^i|oo|5'|/(^^|^t5|l) > |Xiip|S'|/(n|its|f ) for vectors it satisfying the 
restrictions in (23), SCIF(^, S) is a slightly larger quantity than the compatibility factor 



|vdGB09] . The following theorem is a consequence of Theorem 2 in [SZll] . 

Theorem 2. Let {A,^,co} he fixed positive constants withS, > 1 and A > 1). Let 

/3 and a be as in (9) with Aq = A^J{2/n)\og{p/e). Suppose m.m.\s\<^g SCIF{^,S) > cq 
and (iTp holds with s{logp)/n — )• as n — )• oo. Then, for sufficiently large n, and (19) 
hold with Ci and C2 depending on {A, ^,co} only. Consequently, (20), (21), and (22) hold. 



The main condition of Theorem [2] is min|5|<^ SCIF(,^, S") > co for s(logp)/n — )• 0. This 
condition holds if the sparse Riesz condition holds: c^ <eigenvalue(X5X5/n) < c* for all 
\S\ < {c*/c^ + l/2)s jZHOSi lYZlOl IZhalOj . This main condition of Theorem 2 holds for a 
certain class of random design matrices X described in Subsection 3.3. 



3.3. Random designs. We assume in this subsection that X = (xi, . . . ,Xp) 
normalized version of a Gaussian random matrix X , 



is a column 



(24) 



Xj^/n/\Xj\2, 



X = {xi, . . . , Xp) has iid A^(0, S) rows. 



We assume without loss of generality that the diagonal elements of XI all equal to 1. In this 
setting, Xj is related to other design variables X^j through a linear model 



(25) 



Xj\2n 



-i/2^(i) 



iV(0,(7|/„x„). 



This model is a motive for the use of the Lasso in (10), (11), and H2l. However, the goal of 



the procedure is to find Zj with suitable tj and r]j for controlling the variance and bias of 
the LDPE (|4| as in Theorem rn The following theorem justifies low-dimensional statistical 
inference based on LDPE (|4) for the random design (24) by verifying all conditions of 
Theorems [1] and [2] under the sparsity condition (17) with s{\ogp)/^/n — )• 0. Define a class 



of vectors with small £g tail as 



j{s,X) = {beW: E^=imin(|&,|VA^l) < s]. 



Theorem 3. Suppose ^ and (24) hold with diag{12) 



(init) 



where < c* < c* < 00 are fixed. Let /3' ' and a he as in 
for a certain fixed A> 1. Let Zj he as in (12) with A* x A^ 



and eigenvalues((S) C [c=i,,c*], 

with Ao = Ay/{2/n) log(p/e) 
and Pj be as in (4). Suppose 



( ID holds with s(logp)/n^/^ 0. Then, min|5|<5 SCIF{^,S) > cq in (23) with fixed cq > 0. 
Consequently, \2(^, ^21^ and (22) hold as in Theorem^ Moreover, if s* (log p)/n — )• 0, then 



< n whenever G ^2(3* , Xq) 
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Scaled LASSO error 



Scaled LASSO error 



Scaled LASSO error 



D 



-0.4 

error 



A 



-0.4 0.0 

error 



-0.4 0.0 
error 



LDPE error 



-0.2 0.2 



LDPE error 



LDPE error 



LDPE error 



=cn. 



0^ 



-0.6 -0.2 0.2 0.6 
error 



-0.5 0.0 0.5 
error 



Figure 1 . Histogram of errors when estimating maximal f3j using the scaled 
Lasso and LDPE, in simulation settings (A), (B), (C), and (D), from left to 
right. 



Remark 3. It follows from the block matrix inversion formula that in (25) 
1-j = {{'^'^)jk\xk\2/\xj\2,k j^, a] = 

Since \xj\\ ~ Xn; condition G ^2('S*, Aq) in Theorem^can be replaced by k / 

j) S ^2(3*, Ao). This does not add much restriction to the condition that eigenvalues{H) C 
[c*,c*]. 

4. Simulation Results 



Our simulation design is set to clearly violate the assumptions of theorems in Sec- 



tion 3. We set n = 200, p 



3000, 



Xuniv = \/(2/n) logp, (3j 



3A, 



for j 

1/21 



1500, 1800, 2100, . . . , 3000, and f3j = 3Xuniv/j" for ah other j. This gives (s, s*(logp)/n 
(8.93,5.05) and (29.24,16.55) respectively for a = 2 and 1, while the theorems require 
s{logp)/ ^/n — )• 0, where s = min(|/3j|/AM„TO, 1). We run simulation experiments with 
100 replications in each setting. In each replication, we generate an independent copy of 



(X, X, y), where X = (xjj)„xp has iidA^(0,S) rows with E = {p'^~ ')pxp, Xj = Xjy/n/\xj\2, 
and {X,y) as in ([l]) with a = 1. This example includes four cases, labeled (A), (B), (C), 
and (D), respectively: (a,p) = (2,1/5), (1,1/5), (2,4/5), and (1,4/5), with case (D) being 
the most difficult one. 

In addition to the Lasso with penalty level A = Xuniv and the scaled Lasso ([9| , we consider 



the LDPE and its restricted version with (14). For both the LDPE and restricted LDPE, 
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Estimator 



Lasso scaled Lasso LDPE 



(A) 


mean 


-0 


2946 


-0 


4601 


-0.0149 




variance 





0090 





0187 


0.0175 




median 





2941 





4459 


0.0885 


(B) 


mean 


-0 


2998 


-0 


5392 


-0.0075 




variance 





0108 





0239 


0.0239 




median 





2949 





5296 


0.0993 


(C) 


mean 


-0 


3009 


-0 


4406 


-0.0184 




variance 





0135 





0229 


0.0280 




median 





3054 





4411 


0.1088 


(D) 


mean 


-0 


3214 


-0 


5514 


-0.0290 




variance 





0193 





0351 


0.0406 




median 





3179 





5581 


0.1315 



Table 1 . Summary statistics for the errors of the Lasso with fixed penalty 
A = ^univ, the scaled Lasso, and LDPE estimates of the seven maximal 

= \P\oo. 



the scaled Lasso is used to generate (5 and a and the procedure (|ll|), (|lOl), and (|12) 



is used to generate Zj, with kq = 1/2 and = \univ to guarantee r]j < yj2 log p. For the 



restricted LDPE, m = 20 is used in (14) 



The asymptotic normality of the LDPE holds well in our simulation experiments. Table [T] 
and Figure [T] demonstrate the behavior of the LDPE for the largest /3j , compared with the 
Lasso. For a small increase in variance, the LDPE significantly reduces the bias of the Lasso 
and scaled Lasso estimates. The scaled Lasso has more bias than the Lasso, but is entirely 
data-driven. These results hold over all simulation settings. The LDPE shows the largest 
improvement in performance when estimating large /3j. Although the asymptotic normality 
of the LDPE holds even better for small /3j in the simulation study, a parallel comparison 
for small /3j is not meaningful; the Lasso typically estimates small Pj by zero, while the raw 
LDPE is not designed to be sparse. 



Table 2. Mean coverage for LDPEs for all fij 



(A) 


(B) 


(C) 


(D) 


coverage 0.9578 


0.9602 


0.9502 


0.9501 



The overall coverage probability of the LDPE-based confidence interval matches well to 
the preassigned level, as expected from the asymptotic normality. The LDPE creates confi- 
dence intervals fij it 1.96S'rj with approximately 95% coverage. Refer to Table for precise 



C.-H. Zhang and S. S. Zhang 



11 



Coverage vs index Coverage vs index Coverage vs index Coverage vs index 




1000 2000 3000 1000 2000 3000 1000 2000 3000 1000 2000 3000 



Coverage over 100 trials Coverage over 100 trials 




T — I — I — I — I — I — r ^ — I — I — I — I — I — r 

0.88 0.92 0.96 1.00 0.88 0.92 0.96 1.00 



coverage probability coverage probability 



Coverage over 100 trials Coverage over 100 trials 




—\ 1 1 r T 1 1 1 r 

0.85 0.95 0.80 0.90 1.00 



coverage probability coverage probability 



Figure 2. Top: Relative coverage frequencies versus the index of f3j. Bot- 
tom: The percentage of variables for given values of the relative coverage 
frequency, superimposed with the binomial(100, 0.95) probability mass func- 
tion. Figures depict results from simulations (A), (B), (C), and (D), from 
left to right. 



values. Moreover, the empirical distribution of the simulated relative coverage frequencies 
matches that of the binomial(100, 0.95) very well in all four settings, as shown in the bottom 
row of Figure [2} 

Although the coverage probability is satisfactory over all f3j, we have some degree of 
under-coverage when large values of /3j are associated with highly correlated columns of X. 
This is most apparent when plotting coverage vs. index in simulation (D), but is also visible 
in simulation (C). In the top row of Figure [2| the range of simulated relative coverage 
frequencies in settings (C) and (D) is considerably larger, due to a few instances of low 
coverage for the smallest indices. Note that these are the two simulations with higher 
correlation between adjacent columns of X, and the first few /3j are all relatively large. 

The restricted LDPE (14) eliminates the bias caused by relatively large values of (3j 
associated with highly correlated columns of X. Figure [3] shows that this reduces the bias 
originating from other large values of (3 and improve coverage probabilities, at the cost of 
an increase in the variance of the estimator. 

We may also consider the performance of LDPE as a point estimator. Figure |4] shows the 
behavior of the median absolute error for the estimation of f3j . The Lasso and scaled Lasso 
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Confidence Intervals 
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Figure 3. Plots of the median of f3j (top) and relative coverage frequencies 
(bottom) over 100 replications for the LDPE (left) and the restricted LDPE 
(right). The restricted LDPE has smaller bias and more accurate coverage 
probabilities, but larger variability. 



estimators have large biases for bigger values of /3j but perform very well for smaller values. 
On the other hand, the median absolute error for the LDPE is very stable since it is not 
designed to be sparse without post processing. They perform significantly better than the 
scaled Lasso (the initial estimator of the LDPE) for large /3j, but Lasso outperforms LDPE 
as a point estimator for smaller This can be adjusted by thresholding the LDPE. 

Once we have thresholded, we may consider the loss of the LDPE as an estimate of 
the vector /3. The mean, standard deviation, and median loss over 100 replications is 
listed in Table [sj In the simplest case, (A), the thresholded LDPE outperforms both the 
Lasso and the scaled Lasso in terms of loss. The mean and median loss for the Lasso 
and LDPE is comparable in simulations (B) and (C), with both outperforming the scaled 
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Figure 4. Plots of median absolute error for scaled Lasso and LDP estima- 
tors, in simulations (A) and (D). The patterns in simulations (B) and (C) 
are similar to those in (A) and (D), respectively. 

Lasso. However, the standard deviation of the loss is larger for the LDPE. In the hardest 
case, (D), which has both a high correlation between adjacent columns of X and a slower 
decay for (3, the performance of the LDPE is similar to that of the scaled Lasso, with the 
Lasso with fixed penalty A = crXuniv outperforming both data-driven estimators. 

5. Discussion 

We have developed the LDPE method of constructing /3i , . . . , /3p for the individual re- 
gression coefficients and estimators for their finite dimensional covariance structure. Under 
proper conditions on X and (3, we have proven the asymptotic unbiasedness and normality 
of the finite dimensional distribution functions of these estimators and the consistency of 
their estimated covariances. This allows one to assess the level of significance of each un- 
known coefficient f3j more accurately than commonly used estimators of the entire vector 
p. The raw LDPE estimator is not sparse, but it can be thresholded to take advantage 
of the sparsity of /3. Compared with existing variable selection methods, thresholding the 
LDPE allows selection of large regression coefficients in the presence of many small regres- 
sion coefficients, and the sampling distribution of the thresholded LDPE can be bounded 
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Confidence Intervals 



Estimator 







Lasso 


scaled Lasso 


T-LDPE 


(A) 


mean 


0.7235 


1.6693 


0.3072 




sd 


0.1965 


0.6738 


0.3599 




median 


0.6918 


1.4946 


0.1741 


(B) 


mean 


1.0023 


2.6109 


1.0233 




sd 


0.2286 


0.8202 


0.7669 




median 


0.9584 


2.4635 


0.6619 


(C) 


mean 


0.7796 


1.5717 


0.9990 




sd 


0.2250 


0.5952 


0.9033 




median 


0.7390 


1.4266 


0.8426 


(D) 


mean 


1.1555 


2.6811 


2.9132 




sd 


0.3056 


0.7694 


1.3340 




median 


1.1349 


2.7298 


2.8206 



Table 3. Summary statistics of loss for the Lasso with A = aX 
scaled Lasso, and thresholded LDPE estimates of /3 



based on our theoretical results. For example, we allow e to range in [ao/p^, 1] in Section 3 
to cover further calculations involving Bonferroni and other multiplicity adjustments. 

In this paper, we use the Lasso to provide a relaxation of the projection of xj to xj-. 
This choice is primarily due to our familiarity with the computation of the Lasso and the 
readily available scaled Lasso method of choosing a penalty level. We have also considered 
some other methods of relaxing the projection. Among these other methods, a particularly 
interesting one is the following constrained minimization of the variance of the noise term 
in ([5|: 

(26) Zj = arg min I l^ll : \zjxj\ = n,max\zjxi^/n\ < Aj|. 

Similar to the Lasso in Q, (26) is quadratic programming. The Lasso solution ^ is feasible 
in (26) with Xjn/\zjxj\ = X'j. Our results on these and other extensions of our ideas and 
methods will be presented in a forthcoming paper. 



6. Appendix 

Proof of Proposition [l| Since 7_j(A) is continuous and piecewise linear in A, it 
suffices to consider a fixed open interval X £ Iq in which A = {k ^ j : Wfc(A) / 0} and 
s = sgn(7_j(A)) do not change with A. It follows from the Karush-Kuhn- Tucker conditions 
for the Lasso that 



Xl{x, - XaJaW}/^ = Xl{x, - X_,7-i(A)}/n = Xsa- 
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This gives (a/9A)7^(A) = - 
{d/dX)\xj-X^j^_^, 



{X'j^XA/ny^SA for ah A G Iq. It fohows that 

,.(A)|^ = -2{(a/5A)7A(A)}^x5(a^,-XA7^(A)) 

= 2{{XT^XA/n)-'sA}''xl{x, - XaIaW) 
= i2/X)\PAix,-XA7Aml 

where Pa = X a{X'^X a)~^ X"^ is the projection to the column space of Xa- Thus, 
|zj(A)|2 is nondecr easing in A. Moreover, 

{X^/2){d/dX){v{X)/n}-' = (AV2)(9/9A){A->,-X^7A(A)li} 

= \Pa{xj - Xa^aW}\1 - - XaIaWH < 0, 
so that 77(A) is nondecreasing in A. Since xJzj{X) = |zj(A)|2 + {X ^j'y_j{X)}^ Zj{X) = 
\^jW\2 + A|7_j(A)|i, we also have r(A) < l/zj{X). Finally, (13) follows since Xj < fi^A 



and \zj{a*)\2 



1/2 



J 3 
□ 



Proof of Theorem 1. Since Zj is a determined by X, ([5]) and ([7]) imply 



riHd,-Pj)-z]e/\z,\2 



< 



max \ zjxk\ 



^il2 



{init) 



P\i = Vj\P 



{init) 



This and (18) yield (20) as well as the validity of V in (15) as the approximate covariance 
between f3j and The rest of the Theorem then follows directly from (19). □ 

Proof of Theorem [2| Let a* = \e\2l\fn- It follows from Theorem 2 in [SZllj that 

P{|3 - < CocT*sAo, |?/cT* - 1| < CosAg} > 1 - e, 

with a constant Co depending on {A, ^,co} only. Since n(o"*/o")^ has the Xn distribution, 
(18) and (19) hold with C\ and C2 slightly different from Cq. □ 

Proof of Theorem[3| Since the eigenvalues of XI are uniformly bounded, min|5|<sSCIF(,^, 5) 
is uniformly bounded away from zero for small |S'|(logp)/n [ZH08| lYZlO) . This implies the 
condition of Theorem [2l 

Consider a fixed j G J^. Let be the scaled Lasso estimator of Oj in (25) with penalty 
level Ao and Sj be the index set of the elements of 7_j with absolute value cJjAo or larger. 
Since E\Xs-ls-\\ln < 0{c*)\jsS ^ 0(s*A§) = o(l) and the SCIF is uniformly bounded 
away from zero, Theorem 1 of [SZllj implies (Jj/aj = 1 + o(l). Thus, in the Lasso path 
(1§, TjidjXo) < l/|z,(a,Ao)|2 = l/(?ini/2) ^ n-1/2. 

Now consider ( |12[ ) with penalty level A* x Aq. By Proposition jT| both |zj(A)|2 and 
VjW = ?^A/|zj(A)]2are nondecreasing in A, so that aj x a* in (11). Consequently, Tj{Xj) < 
n~^/^ by Proposition [1] □ 
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